The effect of alternative tree representations on tree bank 

grammars 



> 
o 

(N 

(N 

> 



X 
J3 



Mark Johnson* 

Cognitive and Linguistic Sciences, Box 1978 
Brown University 
Providence, RI 02912, USA 
Mark_Johnson@Brown . edu 



Abstract 

The performance of PCFGs estimated from 
tree banks is sensitive to the particular way 
in which linguistic constructions are repre- 
sented as trees in the tree bank. This paper 
presents a theoretical analysis of the effect of 
different tree representations for PP attach- 
ment on PCFG models, and introduces a new 
methodology for empirically examining such 
cffccto uoing tree tranaformationa. — ft ahowa 

that nne transformation which rnpipa the 1a- 



bel of a parent node onto the labels of its chil- 
dren, can improve the performance of a PCFG 
model in terms of labelled precision and recall 
on held out data from 73% (precision) and 
69% (recall) to 80% and 79% respectively. It 
also points out that if only maximum likeli- 
hood parses are of interest then many produc- 
tions can be ignored, since they are subsumed 
by combinations of other productions in the 
grammar. In the Penn II tree bank gram- 
mar, almost 9% of productions are subsumed 
in this way. 



1 Introduction 

Parsers which are capable of analysing unre- 
stricted text are of considerable scientific inter- 
est, and have technological applications in ar- 
eas such as machine translation and informa- 
tion retrieval as well. One way to produce such 
a parser is to extract a grammar from one of the 
larger tree bank corpora currently available. 

The relative frequency estimator described 
below provides a simple way to estimate from 
a tree bank corpus a Probabalistic Context 
Free Grammar (PCFG) that generates Part Of 

I would like to thank Chris Manning, whose observa- 
tion that PCFG parsers do not accurately reproduce PP 
attachment preferences in their training data stimulated 
this work, as well as Eugene Charniak, Stuart Gcman 
and our students at Brown. 



Speech (POS) tags. Such a PCFG induced from 
a sufficiently large corpus typically generates all 
possible POS tag strings. A parsing system can 
be obtained by using a parser to find the max- 
imum likelihood parse tree for an input string. 
Such parsing systems often perform as well as 
other broad coverage parsing systems for pre- 
dicting tree structure from POS tags ( phar- 
niak, 1996 ). In addition, many more sophis- 



ticated parsing models are elaborations of such 
PCFG models, so understanding the properties 
of PCFGs is likely to be useful ([Charniak, 1997 ; 
[Collins, 1997| ). 



It is well-known that natural language ex- 
hibits dependencies that Context Free Gram- 
mars (CFGs), and hence PCFGs, cannot de- 
scribe ( Shieber, 1985| ). But as explained be- 
low, the independence assumptions implicit in 
PCFGs introduce biases in the statistical model 
induced from a tree bank even in constructions 
which are adequately described by a CFG. The 
direction and size of these biases depend on fac- 
tors such as the following: 

• the precise tree structures used in the tree 
bank, and 

• whether the set of well-formed trees accord- 
ing to the linguistic model used to assign 
trees to strings can be described with a 
CFG. 

This paper explains how such biases can arise, 
and presents a series of experiments in which the 
trees of a tree bank corpus are systematically 
transformed to other tree structures to obtain a 
grammar used for parsing, and the inverse tree 



transform is applied to the structures produced 
using this grammar before evaluation. One of 
the transformations described here improves the 
average labelled precision and recall on held out 
data from 73% (precision) and 69% (recall) to 
80% and 79% respectively. 

2 Probabalistic Context Free 
Grammars 

A PCFG is a CFG in which each production 
A — > a in the grammar's set of productions R is 
associated with an emission probability P(A ^ 
a) that satisfies a normalization constraint 
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a:A—taeR 



and a consistency or tightness constraint not 
discussed here. 

A PCFG defines a probability distribution 
over the (finite) parse trees generated by the 
grammar, where the probability of a tree r is 
given by 



Pfr) 



n 



a 



where Cr{A — s- a) is the 'count' of the local 
tree consisting of a parent node labelled A with 
a sequence of immediate children nodes labelled 
a in r, or equivalently, the number of times the 
production A ^ a is used in the derivation r. 

The PCFG which assigns maximum likeli- 
hood to a tree bank corpus f is given by the 
relative frequency estimator. 
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Ea':A^a'gi?Cf(A ^ q') 



Here Cf{A — > a) refers to the 'count' of the 
local tree in the tree bank, or equivalently, the 
number of times the production A ^ a would 
be used in derivations of exactly the trees in f . 

It is practical to induce PCFGs from tree 
bank corpora and find maximum likelihood 
parses for such PCFGs using relatively modest 
computing equipment. All the experiments re- 
ported here used the Penn II Wall Street Jour- 
nal (WSJ) corpus, modified as described by 
Charniak (Charniak, 1996), i.e., empty nodes 



were deleted, and all other components of nodes 
labels except syntactic category were removed. 

Grammar induction or training used the 
39,832 trees in the F2-21 sections of the Penn II 
WSJ corpus, and testing was performed on the 
1,576 sentences of length 40 or less of the F22 
section of the corpus. Parsing was performed 
using an exhaustive CKY parser that returned 
a maximum likelihood parse. Ties between 
equally likely parses were broken randomly; on 
the tree bank grammar this leads to fluctuations 
in labelled precision and recall with a standard 
deviation of approximately 0.07%. 

3 Different tree structure 

representations of adjunction 

There is considerable variation in the tree struc- 
tures used in the linguistic literature to repre- 
sent various linguistic constructions. In this pa- 
per we focus on variations in the representation 
of adjunction constructions, particularly PP ad- 
junction, but similiar variation occurs in other 
constructions as well. 

Early analyses in transformational grammar 
typically adopted a 'flat' representation of ad- 
junction structures in which adjuncts are repre- 
sented as siblings of the phrasal head, as shown 
in Figure This representation does not sys- 
tematically distinguish between adjuncts and 
arguments, as both are attached as children of 
a single maximal projection. 

The Penn II tree bank represents PP adjunc- 
tion to VP in this manner, presumably because 
it permits the annotators to avoid having to de- 
termine whether the PP in question is an ad- 
junct or an argument. 

Because this representation attaches all of the 
adjuncts modifying the same phrase to the same 
node, distinct CFG productions are required for 
each possible number of adjuncts. Thus the set 
of all possible trees following this representa- 
tion scheme can only be generated by a CFG 
if one imposes an upper bound on the number 
of PPs that can be adjoined to any one single 
phrase, but according to standard linguistic wis- 
dom there is no natural bound on the number 
of PPs that may be adjoined to a single phrase. 



VP 



VP 



V NP PP 
ate her dinner with a fork 



V NP PP PP 

ate her dinner with a fork at the table 



Figure 1: 'Flat' attachment representations of adjunction, where adjuncts are attached as siblings 
of a lexical head (in this case, the verb ate). The Penn II tree bank represents VP adjunction in 
this manner. 



Later transformational analyses adopted the 
more complex 'Chomsky adjunction' repre- 
sentation of adjunction structures for theory- 
internal reasons (e.g., it was a corollary of 
Emmonds' "Structure Preserving Hypothesis"). 
This representation provides an additional level 
of recursive phrasal structure for each adjunct, 
as depicted in Figure |2[ 

Modern transformational grammar, following 
Chomsky's X' theory of phrase structure, rep- 
resents adjunction with similiar recursive struc- 
tures; the major difference being that the non- 
maximal phrasal nodes are given a new, distinct 
category label. 

Because the Chomsky adjunction structure 
and the X' theory based on it use a single rule 
to recursively adjoin an arbitrary number of ad- 
juncts, the set of all tree structures required by 
this representation scheme can be generated by 
a CFG. 

The Penn II tree bank uses a mixed kind of 
representation for NP adjunction, involving two 
levels of phrasal structure irrespective of the 
number of adjuncts, as shown in Figure ^. This 
representation permits adjuncts to be system- 
atically distinguished from arguments, although 
this does not seem to have been done systemat- 
ically in the Penn II corpus Just as with the 
'flat' representation, the set of all possible trees 
required by this mixed representation cannot be 



^The tree annotation conventi ons used in the P enn 11 
corpus are described in detail in ( |Bies et al., 1995 ). The 
mixed representation arises from the fact that "postmod- 
ifiers are Chomsky-adjoined to the phrase they modify" 
with the proviso that "consecutive unrelated adjuncts 
are non-recursively attached to the NP the modify". 
However, because constructions such as appositives, em- 
phatic reflexives and phrasal titles are associated with 
their own level of NP structure, it is possible for NPs 
with more than two levels of structure to appear. 



NP 



NP pp 
NP^'^'^P about the rules 
N to the agency 
letters 

Figure 4: A tree structure generated by any 
PCFG that generates the trees in Figure ^, yet it 
does not fit the general representational scheme 
for adjunction structures used in the Penn II 
tree bank. 

generated by a CFG unless the number of PPs 
adjoined to a single phrase is bounded. 

Perhaps more seriously for PCFG modelling 
of such tree structures, a PCFG which can gen- 
erate a nontrivial subset of such 'two level' NP 
tree structures will also generate tree struc- 
tures which are not instances of this representa- 
tional scheme. For example, the NP production 
needed to produce the leftmost tree in Figure ^ 
can apply recursively, generating an alternative 
tree structure for the yield of the rightmost tree 
of Figure ^, as shown in Figure ^. It is not clear 
what interpretation to give tree structures such 
as these, as they do not fit the chosen represen- 
tational scheme for adjunction structures. 

4 PCFG models of PP adjunction 

This section presents a theoretical investigation 
into the effect of different tree representations 
on the performance of PCFG models of PP ad- 
junction. The analysis of four different models 
is presented here. 

Clearly actual tree bank data is far more com- 
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VP 

VP PP 

V^^^^^P with a fork 
ate her dinner 



VP PP 
VP ~PP at the table 

V^^^^P with a fork 
ate her dinner 



Figure 2: 'Chomsky adjunction' representations of adjunction, where each adjunct is attached 
as the unique sibhng of a phrasal node (in this case, VP). Chomsky's X' theory, used by modern 
transformational grammar, analyses adjunction in a structurally similiar way, except that the non- 
maximal (in these examples, non-root) phrasal nodes are given a new category label (in this case 
V). 

NP 



NP PP PP 

N to the agency about the rules 
letters 




Figure 3: The representation of NP adjunction used in Penn II tree bank, where adjuncts are 
attached as siblings of a single NP node. 



plicated than the simple models investigated in (Ai) (Bi) VP 

this section, and the next section investigates 
the effects of different tree representations em- 



pirically by applying tree transformations to the NP PP y^qI N P NP 

Penn II tree bank representations. However, the 

theoretical models discussed in this section show ^ Det N 

clearly that the choice of tree representation can 
in principle affect the generalizations made by 
a PCFG model. 




V NP PP 



4.1 The Penn II tree bank 
representations 



Suppose we train a PCFG on a corpus fi con- 
sisting only of two different tree structures: the 
NP attachment structure labelled (Ai) and the 
VP attachment tree labelled (Bi). 



In the Penn II tree bank, structure (Ai) oc- 
curs 7,033 times in the F2-21 subcorpora and 
279 times in the F22 subcorpus, and struc- 
ture (Bi) occurs 7,717 times in the F2-21 sub- 
corpora and 299 times in the F22 subcorpus. 
Thus / ^ 0.48 in both the F2-21 subcorpora 
and the F22 corpus. 

Returning to the theoretical analysis, the rel- 
ative frequency counts Ci and the non-unit pro- 
duction probability estimates Pi for the PCFG 
induced from this two-tree corpus are as follows: 
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Ci{R) 


Pi(i?) 


VP - 


V NP 
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VP - 


-4 V NP PP 


l-f 


1-/ 


NP - 


^ Det N 


2 


2/(2 + /) 


NP - 


NP PP 


/ 


//(2 + /) 



Of course, in a real tree bank the counts of 
all these productions would also include their 
occurences in other constructions, so the theo- 
retical analysis presented here is a crude ideal- 
ization. 

Thus the estimated likelihoods using Pi of the 
tree structures (Ai) 



Pi(Ai) 
Pi(Bi) 



and (Bi) are: 

(2 + /P 
4(1-/) 
(2 + /)2' 



Clearly Pi(Ai) < / and Pi(Bi) < (1 - /) 
except at / = and / = 1, so in general the 
estimated frequencies using Pi differ from the 
frequencies of Ai and Bi in the training corpus. 
This is not too surprising, as the PCFG Pi as- 
signs non-zero probability to trees not in the 
training corpus. For example, Pi assigns non- 
zero probability to the tree in Figure We 
discuss the ramifications of this in section ^ 

In any case, in the parsing applications men- 
tioned earlier the absolute magnitude of the 
probability of a tree is not of direct interest; 
rather we are concerned with its probability rel- 
ative to the probabilities of other, alternative 
tree structures. Thus it is arguably more rea- 
sonable to ignore the "spurious" tree structures 
generated by Pi but not present in the train- 
ing corpus, and compare the estimated relative 
frequencies of (Ai) and (Bi) under Pi to their 
frequencies in the training data. 

Ideally the estimated relative frequency /i of 
(Ai) 

A = Pi(r = Ai :rG {Ai,Bi}) 

Pi(Ai) 
Pi(Ai)+Pi(Bi) 

2-/ 

will be close to its actual frequency / in the 
training corpus. The relationship between / 
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Figure 5: The estimated normalized frequency 
/ of NP attachment using the PCFG models 
discussed in the text as a function of the relative 
frequency / of NP attachment in the training 
data. 

and /i is plotted in Figure ^. The value of /i 
can diverge substantially from /. For example, 
at / = 0.48 (the estimate obtained from the 
Penn II corpus presented above) /i = 0.15. 

4.2 'Chomsky adjunction' 
representations 

Now suppose that the corpus contains the fol- 
lowing two trees (A2) and (B2), which are the 
'Chomsky adjunction' representations of NP 
and VP attached PP's respectively, with rela- 
tive frequencies / and (1 — /) as before. 




The counts C2 and the non-unit production 
probability estimates P2 for the PCFG induced 
from this two-tree corpus are as follows: 



R 



VP ^ V NP 
VP ^ VP PP 
NP ^ Det N 
NP ^ NP PP 



1 1/(2-/) 
1-/ (l-/)/(2-/) 

2 2/(2 + /) 
/ //(2 + /) 

The estimated likelihoods using P2 of the tree 
structures (A2) and (B2) are: 

4/ 



P2(i?) 



P2(A2) 
P2(B2) 



(4-/2)(2 + /)2 

4(1-/) 
(4-/ 



2^2 



As in the previous subsection P2(A2) < / and 
P2(B2) < (1 - /) because the PCFG assigns 
non-zero probability to trees not in the train- 
ing corpus. Again, we calculate the estimated 
relative frequencies of (A2) and (B2) under P2. 

h = P2(r = A2 :r G {A2,B2}) 
2/2 - / - 2 

The relationship between / and /2 is plotted 
in Figure ^. The value of /2 can diverge from 
/, although not as widely as /i. For example, 
at / = 0.48 /2 = 0.36. Thus the precise tree 
structure representations used to train a PCFG 
can have a marked effect on its performance. 

4.3 Penn II representations with 
parent annotation 

One of the weaknesses of a PCFG is that it 
is insensitive to non-local relationships between 
nodes. If these relationships are significant then 
a PCFG will be a poor language model. Indeed, 
the sense in which the set of trees generated by 
a CFG is "context free" is precisely that the la- 
bel on a node completely characterizes the rela- 
tionships between the subtree dominated by the 
node and the set of nodes that properly domi- 
nate this subtree. 

Thus one way of relaxing the independence 
assumptions implicit in a PCFG model is to 
systematically encode more information in node 
labels about their context. This subsection ex- 
plores a particularly simple kind of contextual 
encoding: the label of the parent of each non- 
root nonpreterminal node is appended to that 



node's label. The labels of the root node and 
the terminal and preterminal nodes are left un- 
changed. 

For example, assuming that the Penn II for- 
mat trees (Ai) and (Bi 



are 



of subsection [4.1 
immediately dominated by a node labelled S, 
this relabelling applied to those trees produces 
the trees (A3) and (B3) below. 

(A3) VP^S 

V NP^VP 
NP'^NP^FT'^NP 
Det N P NP^PP 



Det N 

(B3) VP^S 

XT^NlX/p^^Ayp 

Det N P NP^PP 
Det N 

We can perform the same theoretical analy- 
sis on this two tree corpus that we applied to 
the previous corpora to investigate the effect of 
this relabelling on the PCFG modelling of PP 
attachment structures. 

The counts C3 and the non-unit production 
probability estimates P3 for the PCFG induced 
from this two-tree corpus are as follows: 
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CsiR) 




VP^S ^ 


V NP'^VP 


f 
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VP^S ^ 


V NP^VP PP-^VP 


1-/ 


1-/ 


NP'^VP 


^ Det N 


1-/ 


1-/ 


NP'^VP 


^ NP^NP PP^NP 


/ 


/ 



The estimated likelihoods using P3 of the tree 
structures (A3) and (B3) are: 



P3(A3) 
P3(B3) 



f 
(1 



As in the previous subsection P3(A3) < / and 
P3(B3) < (1 — /). Again, we calculate the es- 
timated relative frequencies of (A2) and (B2) 
under P2. 



/3 



P3(r = A3 :r€ {A3,B3}) 



P + {i-fY 

The relationship between / and /a is plotted 
in Figure ^. The value of /a can diverge from 
/, just like the other estimates. For example, 
at / = 0.48 /s = 0.46. Thus as expected, in- 
creased context information in the form of an 
enriched node labelling scheme can markedly 
change PCFG modelling performance. 

5 Tree transformations 

The last section presented simplified theoretical 
analyses of the effect of variation in tree rep- 
resentation and node labelling on PCFG mod- 
elling of PP attachment preferences. This sec- 
tion reports the results of an empirical investi- 
gation into the effect of changes in tree repre- 
sentation. These experiments were conducted 
by: 

1. systematically transforming the trees in the 
training corpus F2-21 by applying a tree 
transform X, 

2. inducing a PCFG Gx from the transformed 
F2-21 trees, 

3. finding the maximum likelihood parses 
Y{f)x of the yield of each sentence in the 
F22 corpus with respect to the PCFG Gx, 

4. applying the inverse transform to 
these maximum likelihood parse trees 
Y{f)x to yield a sequence of 'detrans- 
formed' trees X'^^(Y{t)x) using (approx- 
imately) the same representational system 
as the tree bank itself, and 

5. evaluating the detransformed trees 
X~^{Y{f)x) with the standard labelled 
precision and recall measures. 

Statistics were also collected on the properties 
of the grammar Gx and its detransformed max- 
imum likelihood parses X~^{Y{f)x)', the full 
results are presented in Table [l|. 

The columns of that table correspond to dif- 
ferent sequences of trees as follows. 



F22: the trees from the F22 subcorpus of the 
Penn II tree bank, 

F22 Id: the maximum likelihood parses of 
the yields of the F22 subcorpus using the 
PCFG estimated from the F22 subcorpus 
itself, 

Id: the maximum likelihood parses of the 
yields of the F22 subcorpus using the 
PCFG estimated from the F2-21 subcor- 
pus (i.e., this corresponds to applying an 
identity transform). 

Parent: as above, except that the parent 
annotation transform described in subsec- 
tion 4.2 was used in training and evalua- 



tion, 

VP: as in Id, except that the flat VP structures 
used in the Penn II tree bank were trans- 
formed into recursive Chomsky adjunction 
structures as described below, 

NP: as above, except that the one-level NP 
structures used in the Penn II tree bank 
were transformed into recursive Chomsky 
adjunction structures, and 

VP-NP: as above, except that both NP and VP 
structures were transformed into recursive 
Chomsky adjunction structures. 

The F22 tree sequence column provides infor- 
mation on the distribution of subtrees in the test 
tree sequence itself. The F22 Id PCFG gives 
data on the case where the PCFG is trained on 
the same data that it is evaluated on, namely 
the F22 subcorpus. This column is included be- 
cause it is often assumed that the performance 
of such a model is a reasonable upper bound 
on what can be expected from models induced 
from training data distinct from the test data. 

The remaining columns describe PCFGs in- 
duced from versions of the F2-21 subcorpora ob- 
tained by applying tree transformations in the 
manner described above. 

The VP transform is the result of exhaus- 
tively applying the tree transforms below. The 
first transform transforms VP expansions with 





F22 


F22 Id 


Id 


Parent 


VP 


NP 


VP-NP 


No. of rules 




2,269 


14,962 


22,773 


14,393 


14,866 


14,297 


Precision 


1 


0.772 


0.735 


0.801 


0.722 


0.738 


0.730 


Recall 


1 


0.728 


0.696 


0.793 


0.677 


0.698 


0.705 


NP attachments 


279 





67 


217 





51 


329 


VP attachments 


299 


424 


384 


350 


240 


427 





NP* attachments 


339 


3 


67 


234 


3 


61 


401 


VP* attachments 


412 


668 


663 


461 


493 


650 


151 



Table 1: The results of an empirical study of the effect of tree structure on PCFG models. Each 
column corresponds to the sequence of trees, either consisting of the F22 subcorpus or transforms of 
the maximum likelihood parses of the yields of the F22 subcorpus with respect to different PCFGs, 
as explained in the text. The first row reports the number of productions in these PCFGs, and 
the next two rows give the labelled precision and recall of these sequences of trees. The last four 
rows report the number of times particular kinds of subtrees appear in these sequences of trees, as 
explained in the text. 



final PPs into Chomsky adjunction structures, 
and the second transform adjoins final PPs with 
a following comma punctuation into Chomsky 
adjunction structures. In both cases it is re- 
quired that the 'lowered' sequence of subtrees 
a be of length 2 or greater. This ensures that 
the transforms will only apply a finite number 
of times. These two rules have the eff^ect of con- 
verting VP final PPs into Chomsky adjunction 
structures. 



VP 



a 



PP 

A 



VP 

A 

a PP 




The NP transform is similiar to the VP trans- 
form. It too is the result of exhaustively apply- 
ing two tree transformation rules. These have 
the effect of converting NP final PPs into Chom- 
sky adjunction structures. In this case, we re- 
quire that a be of length 1 or greater. 



NP a PP 

A AA A 




NP a PP , 

A AA A A NP 



The NP-VP transform is the result of apply- 
ing all four of the above tree transforms. 

The rows of Table |l] provide descriptions of 
these tree sequences (after 'untransformation', 
as described above) and, if appropriate, the 
PCFGs that generated them. 

The labelled precision and recall figures are 
obtained by regarding a sequence of trees f as 
a multiset or bag -E(f) of edges, i.e., triples 
{N,l,r) where is a nonterminal label and I 
and r are left and right string positions in yield 
of the entire corpus. (Root nodes and preter- 
minal nodes are ignored in these edge sets, as 
they are given as input to the parser). Relative 
to a 'test sequence' of trees f' (here the F22 
subcorpus) the labelled precision and recall of 
a sequence of trees f with the same yield as f' 



are calculated as follows: 
Precision(f) = 
Recall (f) = 



|.g(f) n^(fO| 
|^(f) n-E(f' )| 



(The 'n' operation above refers to multiset in- 
tersection). Precision is the fraction of edges 
in the tree sequence to be evaluated which also 
appear in the test sequence, and recall is the 
fraction of edges in the test sequence which also 
appear in sequence to be evaluated. 

The rows labelled NP attachments and VP 
attachments provide the number of times the 
following tree schema, which represent a single 
PP attachment, match the tree sequence.]^ In 
these schema, V can be instantiated by any of 
the verbal preterminal tags used in the Penn II 
corpus. 

VP VP 



V 



NP 



V NP PP 



NP 

A 



PP 
A 



The rows labelled NP* attachments and VP* 
attachments provide the number of times that 
the following more relaxed schema match the 
tree sequence. Here a can be instantiated by 
any sequence of trees, and V can be instantiated 
by the same range of preterminal tags as above. 

VP VP 

v^npI^p""^ 
A A AAA 



As expected, the PCFG based on the Parent 
transformation, which copies the label of each 
parent node onto those of its children, outper- 
forms all other PCFGs in terms of labelled pre- 
cision and recall. 




^The Penn II markup scheme permits a 'pseudo- 
attachment' notation for indicating ambiguous at- 
tachment. However, this is only used relatively 
infrequently — the pseudo-attachment markup only ap- 
pears 27 times in the entire Penn II tree bank — and was 
ignored here. Pseudo-attachment structures count as VP 
attachment structures here. 



The various adjunction transformations only 
had minimal effect on labelled precision and re- 
call. Perhaps this is because PP attachment 
ambiguities, despite their important role in lin- 
guistic and parsing theory, are just one source 
of ambiguity among many in real language, and 
the effect of the alternative representations has 
only minor effect. 

Indeed, in some cases moving to the purport- 
edly linguistically more realistic tree Chomsky 
adjunction representations actually decreased 
performance on these measures. On reflec- 
tion, perhaps this should not be surprising. 
The Chomsky adjunction representations are 
motivated within the theoretical framework of 
Transformational Grammar, which explicitly 
argues for nonlocal, indeed, non context free, 
dependencies. Thus its poor performance when 
used as input to a statistical model which is in- 
sensitive to such dependencies is to be expected. 
Indeed, it might be the case that the additional 
adjunction nodes inserted in the tree transfor- 
mations above have the effect of converting a 
local dependency (which can be described by a 
PCFG) into a nonlocal dependency (which can- 
not). 

Another initially surprising property of the 
tree sequences produced by the PCFGs is that 
they do not reflect at all well the frequency of 
the different kinds of PP attachment found in 
the Penn II corpus. This is in fact to be ex- 
pected, since the sequences consist of maximum 
likelihood parses. To see this, consider any of the 
examples analysed in section ^. In all of these 
cases, the corpora contained two tree structures, 
and the induced PCFG associates each with an 
estimated likelihood. If these likelihoods differ, 
then a maximum likelihood parser will always 
return the same maximum likelihood tree struc- 
ture each time it is presented with its yield, and 
will never return the tree structure with lower 
likelihood, even though the PCFG assigns it a 
nonzero likelihood. 

Thus the surprising fact is that these PCFG 
parsers ever produce a nonzero number of NP 
attachments and VP attachments in the same 
tree sequence. This is possible because the node 
label V in the attachment schema above abbre- 



viates several different preterminal labels (i.e., 
the set of all verbal tags). Further investiga- 
tion shows that once the V label in NP and 
VP attachment schemas is instantiated with a 
particular verbal tag, only either the relevant 
NP attachment schema or the VP attachment 
schema appears in the tree sequence. For in- 
stance, in the Id tree sequence (i.e., produced 
by the standard tree bank grammar) the 67 NP 
attachments all occured with the V label instan- 
tiated to the verbal tag AUX.0 

6 Subsumed rules in tree bank 
grammars 



It was mentioned in subsection |4.1| that it is pos- 
sible for the PCFG induced from a tree bank to 
generate trees that are not meaningful represen- 
tations with respect to the original tree bank 
representational scheme. The PCFG induced 
from the F2-21 subcorpus contains the follow- 
ing two productions: 

P(NP ^ NP PP) = 0.112 
P(NP ^ NP PP PP) = 0.006 

These productions generate the Penn II rep- 
resentations of one and two PP adjunctions to 
NP, as explained above. However, the second of 
these productions will never be used in a max- 
imum likelihood parse, as the parse of the se- 
quence NP PP PP involving two applications of 
the first rule has a higher estimated likelihood. 

In fact, all of the productions of the form 
NP NP PP" where n > 1 in the PCFG in- 
duced from the F2-21 subcorpus are subsumed 
by the NP — > NP PP production in this way. 
Thus PP adjunction to NP in the maximum 
likelihood parses using this PCFG always ap- 
pear as Chomsky adjunctions, even though the 
original tree bank did not use this representa- 
tional scheme for adjunction! 

In fact, a large number of productions in the 
PCFG induced from the F2-21 subcorpus are 
subsumed in this way. Of the 14,962 produc- 
tions in the PCFG, 1,327, or just under 9%, 
are subsumed by combinations of two or more 



productions. Since these productions are never 
used to construct a maximum likelihood parse, 
they can be ignored if only maximum likelihood 
parses are required. 

7 Conclusion 

There may be several ways of representing a 
particular linguistic construction as a tree. Be- 
cause of the independence assumptions implicit 
in a PCFG, the kind of tree representation 
employed can have a dramatic impact on the 
quality of the PCFG model induced. This 
paper introduces a new methodology for ex- 
amining these effects utilitizing tree transfor- 
mations, and showed that one transformation, 
which copies the label of a parent node onto the 
labels of its children, can dramatically improve 
the performance of a PCFG model in terms of 
labelled precision and recall. It also pointed 
out that if only maximum likelihood parses are 
of interest then many productions can be ig- 
nored, since they are subsumed by combinations 
of other productions in the grammar. 
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^This tag was introduced by (Charniak, 1996) to dis- 
tinguish auxihary verbs from main verbs. 



